Entity Resolution of Institutions in Bibliographic Databases

نویسندگان

  • Jeffrey Fisher
  • Peter Christen
  • Qing Wang
  • Paul Wong
چکیده

Acknowledgements Many people have assisted me in carrying out this project. Firstly I would like to thank my academic supervisors, Associate Professor Peter Christen and Dr. Qing Wang for their ideas, support, encouragement and feedback. I would also like to thank Dr. Paul Wong from the ANU Research Office for providing me with a place to work and helpful advice on the project itself and the SCOPUS database. I would like to thank my friends, in particular Swapnil Mishra and Anish Varghese for their good humour and for helping me to keep things in perspective. Lastly, I would like to thank my family for their support and encouragement, and for being so understanding about all the family dinners I missed.Abstract Bibliographic databases are very important for a variety of tasks including measuring research output of institutions and for predicting future areas of research interest. However, incorrect or incomplete data in such databases can compromise any analysis and lead to poor decision making and financial loss. In this project we have performed data matching of institution data in the SCOPUS Bibliographic Database. We used a variety of established data matching methods and adapted them to the suit the particulars of the project. We describe our data cleaning work, including our novel approach for extracting institution names from the values of the organization attribute. We describe the data matching that we have undertaken, both in merging institutions where they have different identifiers but represent the same institution, and in assigning an identifier to records without one. We show that in the first case we can achieve a high coverage and maintain precision over 85%. In the second case the precision drops significantly beyond 40% coverage and we examine reasons why this is occurs. Finally, we present our conclusions along with some suggestions on how this work could be extended in the future.-4

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Cleaning and Matching of Institutions in Bibliographic Databases

Bibliographic databases are very important for a variety of tasks for governments, academic institutions and businesses. These include assessing research output of institutions, performance evaluation of academics and compiling university rankings. However, incorrect or incomplete data in such databases can compromise any analysis and lead to poor decisions and financial loss. In this paper we ...

متن کامل

Comparison of Bibliographic Databases in Retrieving Information on Telemedicine

Background & Aims: Some of the main questions which can be of importance for those researchers who intend to perform a systematic review in a field of science are: ‘What databases should I use for my review?’; ‘Do all these databases have the same value?’; and ‘Which sourcesretrieved the highest of relevant references?’. The main aim of this work was the identification of the best database for ...

متن کامل

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

Bibliographic Databases: Some Critical Points

Current flow of information necessitates a systematic approach to what authors, reviewers and editors read and use as references. The objectivity of communication is increasingly dependent on a comprehensive literature search through online databases (1). Academic institutions wishing to succeed in the global competition secure access to the prestigious databases and archives (2). Journal edito...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013